Podcast Reviews Data Analysis

Table of contents

  • Project Brief
  • Business Stakeholders and Objectives
    • Stakeholder Identification
    • Stakeholder Objectives
  • Notebook Initialization and Exploratory Data Analysis (EDA)
    • Setting up the Coding Environment
      • Importing Libraries
      • Importing Functions
    • Loading the Data
      • Downloading Database
      • Data Brief
      • Cleaning the Data
    • Data Exploration Overview
      • Preliminary Plan for Data Exploration
      • Basic Exploration
      • Detailed Exploration
        • Data Sampling
        • Trends Characteristics Analysis
        • Statistical Interference
  • Conclusions
    • Summary
    • Insights
    • Potential Areas for Investigation

¶

Project Brief

This database for this project is the iTunes Podcast Reviews, sourced from the Scraped iTunes podcast review RSS feeds. It contains the information that spans from 2019 to 2023 across USA, offering valuable time-series data in a specific location.

¶

Business Stakeholders and Objectives

¶

Stakeholder Identification

  • Podcast creators / Podcast sponsors: Interested in understanding audience preferences and improving their content based on feedback.
  • Marketing teams: Aim to identify effective marketing strategies and optimize promotional efforts.
  • Data analysts: Responsible for extracting insights from the data to inform decision-making.
¶

Stakeholder Objectives

  • Podcast creators: To identify popular podcast genres/topics and areas for content improvement through analysis of listener engagement and feedback.
  • Marketing teams: To analyze trends in podcast listenership and sentiment to inform marketing campaigns and target audience outreach.
  • Data analysts: To conduct thorough exploratory analysis to extract actionable insights from the podcast reviews dataset.

⇡¶

Notebook Initialization and Exploratory Data Analysis (EDA)

¶

Setting up the Coding Environment

¶

Importing Libraries, and Credentials

In [1]:
# %pip install python-dotenv pandas kaggle numpy textblob nbformat scipy scikit-learn statsmodels sqlalchemy plotly
In [2]:
# Terminal commands:
# conda activate rapids-24.02
# conda install -c conda-forge python-dotenv pandas numpy textblob nbformat scipy scikit-learn statsmodels sqlalchemy plotly

# Install RAPIDS (WSL PowerShell):
# conda create --solver=libmamba -n rapids-24.02 -c rapidsai -c conda-forge -c nvidia  \
#     rapids=24.02 python=3.10 cuda-version=12.0 \
#     jupyterlab dash
In [3]:
import os
from dotenv import load_dotenv

load_dotenv()

from numba import cuda

cuda.detect()

from sqlalchemy import create_engine
import sqlite3
from math import sqrt

%load_ext cudf.pandas
import pandas as pd
import numpy as np
import cudf
from concurrent.futures import ThreadPoolExecutor
from textblob import TextBlob
from collections import Counter

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize

nltk.download("vader_lexicon")
nltk.download("punkt")

from scipy.stats import (
    pointbiserialr,
    f_oneway,
    chi2_contingency,
    norm,
    spearmanr,
    t,
    pearsonr,
    ttest_ind,
)
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

import plotly
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
import plotly.subplots as sp

pio.renderers.default = "notebook"
plotly.offline.init_notebook_mode()
Found 1 CUDA devices
id 0    b'NVIDIA GeForce RTX 2060'                              [SUPPORTED]
                      Compute Capability: 7.5
                           PCI Device ID: 0
                              PCI Bus ID: 1
                                    UUID: GPU-d5fffe5d-eda6-e044-a96d-7e29d7648f51
                                Watchdog: Enabled
             FP32/FP64 Performance Ratio: 32
Summary:
	1/1 devices are supported
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/cannelle/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to /home/cannelle/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
¶

Importing Functions

In [4]:
from utils.functions import (
    print_missing_and_duplicates,
    map_to_general_category,
    analyze_sentiment,
)

⇡¶

Loading the Data

¶

Downloading Database

In [5]:
# kaggle_json_filename = "kaggle.json"
# notebook_directory = os.getcwd()
# kaggle_json_path = os.path.join(notebook_directory, kaggle_json_filename)

# if os.path.exists(kaggle_json_path):
#     os.environ['KAGGLE_CONFIG_DIR'] = notebook_directory
#     import kaggle
# else:
#     print("Error: kaggle.json file not found in the project root directory.")

# kaggle.api.authenticate()
# kaggle.api.dataset_download_files(dataset="thoughtvector/podcastreviews", path="./datasets", unzip=True)

# download_path = "./datasets"

# old_file_path = os.path.join(download_path, "database.db")
# new_file_path = os.path.join(download_path, "database.sqlite")

# if os.path.exists(old_file_path):
#     os.rename(old_file_path, new_file_path)
In [6]:
cnx = sqlite3.connect("./datasets/database.sqlite")
df = pd.read_sql_query("SELECT name FROM sqlite_master WHERE type='table'", cnx)

print(df)
         name
0        runs
1    podcasts
2  categories
3     reviews
In [7]:
categories = pd.read_sql_query("SELECT * FROM Categories", cnx)
podcasts = pd.read_sql_query("SELECT * FROM Podcasts", cnx)
reviews = pd.read_sql_query("SELECT * FROM Reviews", cnx)
runs = pd.read_sql_query("SELECT * FROM Runs", cnx)

display(categories.head(2))
display(podcasts.head(2))
display(reviews.head(2))
display(runs.head(2))
podcast_id category
0 c61aa81c9b929a66f0c1db6cbe5d8548 arts
1 c61aa81c9b929a66f0c1db6cbe5d8548 arts-performing-arts
podcast_id itunes_id slug itunes_url title
0 a00018b54eb342567c94dacfb2a3e504 1313466221 scaling-global https://podcasts.apple.com/us/podcast/scaling-... Scaling Global
1 a00043d34e734b09246d17dc5d56f63c 158973461 cornerstone-baptist-church-of-orlando https://podcasts.apple.com/us/podcast/cornerst... Cornerstone Baptist Church of Orlando
podcast_id title content rating author_id created_at
0 c61aa81c9b929a66f0c1db6cbe5d8548 really interesting! Thanks for providing these insights. Really e... 5 F7E5A318989779D 2018-04-24T12:05:16-07:00
1 c61aa81c9b929a66f0c1db6cbe5d8548 Must listen for anyone interested in the arts!!! Super excited to see this podcast grow. So man... 5 F6BF5472689BD12 2018-05-09T18:14:32-07:00
run_at max_rowid reviews_added
0 2021-05-10 02:53:00 3266481 1215223
1 2021-06-06 21:34:36 3300773 13139
¶

Data Brief

There are 4 tables:

  1. Categories - Categories data [podcast_id, category]
  2. Podcasts - Podcasts data [podcast_id, itunes_id, slug, itunes_url, title]
  3. Reviews - Reviews data [podcast_id, title, content, rating, author_id, created_at]
  4. Runs - Runs data [run_at, max_rowid, reviews_added]
¶

Cleaning the Data

Checking for duplicates, and missing values

Missing Values and Duplicates Across Columns
JUSTIFICATION FOR PROCESSING:
  • Missing values in a dataset can lead to inaccurate or misleading statistics and machine learning model predictions. They can occur due to various reasons such as data entry errors, failure to collect information, etc. Depending on the nature and extent of these missing values, different strategies can be employed to handle them.
  • Duplicate values in a dataset can occur due to various reasons such as data entry errors, merging of datasets, etc. Duplicates can lead to biased or incorrect results in data analysis. Therefore, it’s important to identify and remove duplicates.
In [8]:
print_missing_and_duplicates(categories, "Categories")
print_missing_and_duplicates(podcasts, "Podcasts")
print_missing_and_duplicates(reviews, "Reviews")
print_missing_and_duplicates(runs, "Runs")
Duplicates in Reviews table:
655
Drop Duplicative Rows:¶
In [9]:
reviews = reviews.drop_duplicates()

Chosen Strategy for Organizing Tables:¶

1. Merging Tables:

  • The three tables are merged based on the podcast_id value.
  • The rows are sorted based on this value.
  • Merging the tables has been chosen to streamline the workflow to facilitate the use of data for various comparisons.
In [10]:
merged_table = pd.merge(categories, podcasts, on="podcast_id", how="outer")
merged_table = pd.merge(merged_table, reviews, on="podcast_id", how="outer")

merged_table.sort_values(by=["podcast_id"], inplace=True)
merged_table = merged_table.reset_index(drop=True)

2. Preparing the data for display:

  • Mapping the category values to fit into one of the categories - Business & Finance, Religion & Spirituality, News & Politics, Sports & Recreation, Arts, Education, Society & Culture, TV & Film, Health & Fitness, Others, Music, True Crime, Comedy, History, Leisure, Kids & Family, Science, Fiction, Technology, Government
In [11]:
processed_dataset = merged_table.copy()
processed_dataset["category"] = processed_dataset["category"].apply(
    map_to_general_category
)
processed_dataset["podcast_title"] = (
    processed_dataset["title_x"].fillna("")
    + " "
    + processed_dataset["title_y"].fillna("")
)

processed_dataset.head(2)
Out[11]:
podcast_id category itunes_id slug itunes_url title_x title_y content rating author_id created_at podcast_title
0 a00018b54eb342567c94dacfb2a3e504 Business & Finance 1.313466e+09 scaling-global https://podcasts.apple.com/us/podcast/scaling-... Scaling Global Very informative Great variety of speakers! 5 CC47C85896D423B 2017-11-29T12:16:43-07:00 Scaling Global Very informative
1 a00043d34e734b09246d17dc5d56f63c Religion & Spirituality 1.589735e+08 cornerstone-baptist-church-of-orlando https://podcasts.apple.com/us/podcast/cornerst... Cornerstone Baptist Church of Orlando Good Sernons I'm a regular listener. I only wish that the ... 5 103CC9DA2046218 2019-10-08T04:23:32-07:00 Cornerstone Baptist Church of Orlando Good Ser...

OUTCOMES:

  • No missing values were found in the datasets.
  • 655 duplicates were found in the Reviews table and were subsequently dropped.
  • The decision was made to reorganize the data into one table, as this could potentially facilitate analysis in further steps.


⇡¶

Data Exploration Overview

Scrutinizing the dataset to identify key patterns, relationships, and trends. This process aids in detecting significant variables and anomalies, leading to more accurate predictions and insights.

¶

Preliminary Plan for Data Exploration

Basic Exploration

  1. Utilize the describe function to provide an overview of numerical and categorical features in each dataset.
  2. Check the distributions of podcasts over categories, ratings, and number of reviews over time.

Detailed Exploration

  1. Data Sampling:

    • Preparing a subset of data
  2. Trends Characteristics Analysis:

    • Understanding podcast listenership trends
    • Identifying popular podcast genres/topics
    • Analyzing sentiment of podcast reviews
  3. Statistical Interference:

    • Correlation between average ratings and voting counts
    • Variances in rating averages across podcast categories
    • Monthly variations in rating averages
¶

Basic Exploration

Summary of Dataset Features
JUSTIFICATION FOR PROCESSING: The processing of dataset features is essential for gaining insights into the underlying data distribution, identifying patterns, and facilitating informed decision-making.
In [12]:
print("Processed Dataset")
display(processed_dataset.describe(include=["object"]).T)
Processed Dataset
count unique top freq
podcast_id 4552196 111544 bf5bf76d5b6ffbf9a31bba4480383b7f 33100
category 4552196 20 Society & Culture 661552
slug 4527973 108919 crime-junkie 33100
itunes_url 4527973 110024 https://podcasts.apple.com/us/podcast/crime-ju... 33100
title_x 4527973 109274 Crime Junkie 33100
title_y 4552196 1138688 Great podcast 30828
content 4552196 2049707 I love this podcast! 404
author_id 4552196 1475285 D3307ADEFFA285C 1660
created_at 4552196 2054352 2017-09-19T08:29:49-07:00 14
podcast_title 4552196 1868684 Crime Junkie Obsessed 466
In [13]:
unique_ratings = processed_dataset["rating"].unique()
count_ratings = len(unique_ratings)
top_rating = processed_dataset["rating"].mode().values[0]
top_rating_freq = processed_dataset["rating"].value_counts().max()
total_ratings = processed_dataset["rating"].count()

print("Unique Ratings:", unique_ratings)
print("Count of Unique Ratings:", count_ratings)
print("Most Common Rating (Top):", top_rating)
print("Frequency of Most Common Rating:", top_rating_freq)
print("Total Number of Ratings:", total_ratings)
Unique Ratings: [5 1 4 2 3]
Count of Unique Ratings: 5
Most Common Rating (Top): 5
Frequency of Most Common Rating: 3982850
Total Number of Ratings: 4552196
Data Distribution Check
JUSTIFICATION FOR PROCESSING: The processing of this data is crucial for gaining insights into the trends and patterns across the database.
In [14]:
category_counts = processed_dataset["category"].value_counts()

fig = px.bar(
    x=category_counts.index,
    y=category_counts.values,
    labels={"x": "Category", "y": "Count"},
)

fig.update_layout(
    title="Podcast Counts by Category",
    xaxis_title="Category",
    yaxis_title="Podcast Count",
    template="plotly_dark",
)

fig.show()
In [15]:
rating_counts = processed_dataset["rating"].value_counts()

fig = px.bar(
    x=rating_counts.index, y=rating_counts.values, labels={"x": "Rating", "y": "Count"}
)

fig.update_layout(
    title="Podcast Counts by Rating",
    xaxis_title="Rating",
    yaxis_title="Podcast Count",
    template="plotly_dark",
)

fig.show()
In [16]:
runs["run_date"] = pd.to_datetime(runs["run_at"]).dt.date

reviews_added_per_day = runs.groupby("run_date")["reviews_added"].sum().reset_index()

fig = px.line(
    reviews_added_per_day,
    x="run_date",
    y="reviews_added",
    title="Reviews Added Over Time",
    template="plotly_dark",
)

fig.update_xaxes(title="Date")
fig.update_yaxes(title="Number of Reviews Added")

fig.show()

OUTCOMES:

  1. Podcast Counts by Category: The Society & Culture category has the highest number of podcasts (661,552k), followed by Business & Finance (435,586k) and Comedy (413,024k). Conversely, the categories with the fewest podcasts are Government (15,483k), Others (25,906k), and Technology (47,808k).
  2. Podcast Counts by Rating: The majority of podcasts received a rating of 5 (3.98 million), indicating the highest level of satisfaction, while the fewest were rated as 2 (94.62k) on a scale of 1-5.
  3. Reviews Added Over Time: The highest number of reviews were recorded on May 10, 2021 (exceeding 1.2 million), followed by July 3, 2022 (559,523k).

¶

Detailed Exploration

⇡¶

Data Sampling
In [17]:
total_rows = len(processed_dataset)

sampling_percentage = 0.1  # 10%
sample_size = int(total_rows * sampling_percentage)
sampled_data = processed_dataset.sample(n=sample_size, random_state=42)

display(sampled_data.head(2))
podcast_id category itunes_id slug itunes_url title_x title_y content rating author_id created_at podcast_title
952553 b50cbb051c6f5b6bb196a12b7a4dc740 Education 1.115025e+09 investing-in-real-estate-clayton-morris-build-... https://podcasts.apple.com/us/podcast/investin... Investing in Real Estate with Clayton Morris |... This podcast is the reason I own rentals I started listening to this podcast in May of ... 5 7BA3C402329DC10 2019-08-02T18:13:11-07:00 Investing in Real Estate with Clayton Morris |...
1952455 c7c948797b0f044fa5e7bb17068fe734 Kids & Family 1.148570e+09 your-parenting-mojo-respectful-research-based-... https://podcasts.apple.com/us/podcast/your-par... Your Parenting Mojo - Respectful, research-bas... I love this show! I basically listen right when new epiodes come... 5 44C9191892B0C4F 2017-07-03T15:26:12-07:00 Your Parenting Mojo - Respectful, research-bas...
In [18]:
# Database Preparation for Export to Looker

# max_file_size_mb = 99
# sampling_percentage = 0.1
# sampling_percentage_step = 0.005

# export_data = None
# file_size_mb = float('inf')

# while sampling_percentage > 0:
#     export_sample_size = int(len(processed_dataset) * sampling_percentage)
#     sampled_data = processed_dataset.sample(n=export_sample_size, random_state=42)

#     sampled_data.to_csv('export_data.csv', index=False)
#     file_size_mb = os.path.getsize('./datasets/export_data.csv') / (1024 * 1024)

#     if file_size_mb <= max_file_size_mb:
#         export_data = sampled_data
#         print(f"Sampled data exported with a file size of {file_size_mb:.2f} MB.")
#         break

#     sampling_percentage -= sampling_percentage_step

# if sampling_percentage <= 0:
#     print("Could not achieve the desired file size within the given constraints.")

# print(f"Final sampling percentage used: {sampling_percentage:.3f}")

⇡¶

Trends Characteristics Analysis

Understanding podcast listenership trends

In [19]:
most_rated_query_df = (
    sampled_data.groupby(["podcast_title"])
    .agg({"rating": ["count", "mean"]})
    .reset_index()
)
most_rated_query_df.columns = ["podcast_title", "rating_count", "avg_rating"]
most_rated_query_df = most_rated_query_df.sort_values(
    by="rating_count", ascending=False
).head(10)

most_rated_query_df_count = most_rated_query_df.sort_values(
    by="rating_count", ascending=False
)
best_rated_query_df_avg = most_rated_query_df.sort_values(
    by="avg_rating", ascending=False
)

fig1 = px.bar(
    most_rated_query_df_count,
    x="podcast_title",
    y="rating_count",
    title="Top 10 Podcasts by Review Frequency",
    template="plotly_dark",
    hover_data={"rating_count": True, "avg_rating": True},
)
fig1.update_xaxes(title="Podcast Title")
fig1.update_yaxes(title="Number of Reviews Received")

fig2 = px.bar(
    best_rated_query_df_avg,
    x="podcast_title",
    y="avg_rating",
    title="Top 10 Podcasts by Average Rating",
    template="plotly_dark",
    hover_data={"rating_count": True, "avg_rating": True},
)
fig2.update_xaxes(title="Podcast Title")
fig2.update_yaxes(title="Average Rating")

fig1.show()
fig2.show()

Identifying popular podcast genres/topics

In [20]:
most_rated_query_df = (
    sampled_data.groupby(["category"]).agg({"rating": ["count", "mean"]}).reset_index()
)
most_rated_query_df.columns = ["category", "rating_count", "avg_rating"]
most_rated_query_df = most_rated_query_df.sort_values(
    by="rating_count", ascending=False
).head(10)

most_rated_query_df_count = most_rated_query_df.sort_values(
    by="rating_count", ascending=False
)
best_rated_query_df_avg = most_rated_query_df.sort_values(
    by="avg_rating", ascending=False
)

fig1 = px.bar(
    most_rated_query_df_count,
    x="category",
    y="rating_count",
    title="Top 10 Categories by Review Frequency",
    template="plotly_dark",
    hover_data={"rating_count": True, "avg_rating": True},
)
fig1.update_xaxes(title="Podcast Category")
fig1.update_yaxes(title="Number of Reviews Received")

fig2 = px.bar(
    best_rated_query_df_avg,
    x="category",
    y="avg_rating",
    title="Top 10 Categories by Average Rating",
    template="plotly_dark",
    hover_data={"rating_count": True, "avg_rating": True},
)
fig2.update_xaxes(title="Podcast Category")
fig2.update_yaxes(title="Average Rating")

fig1.show()
fig2.show()

Analyzing sentiment of podcast reviews

In [21]:
sampled_data["sentiment"] = sampled_data["content"].apply(analyze_sentiment)

fig = px.histogram(
    sampled_data,
    x="sentiment",
    nbins=30,
    title="Sentiment Distribution of Podcast Reviews",
)
fig.update_layout(
    xaxis_title="Sentiment Polarity",
    yaxis_title="Frequency",
    bargap=0.05,
    template="plotly_dark",
)
fig.show()

OUTCOMES:

  • The most rated podcasts were "Crime Junkie Obsessed" with 48 ratings, "Wow in the World Love it" with 45 ratings, and "Wow in the World Awesome" with 41 ratings. These podcasts had respective average ratings of 5.00, 4.98, and 4.93.

  • The podcasts with the highest average ratings were "Crime Junkie Obsessed," "Daebak Show w/ Eric Nam Amazing," "Crime Junkie Amazing," and "Crime Junkie Love it," all achieving a perfect average rating of 5.0.

  • The most rated podcast category was "Society & Culture," which amassed over 66 thousand ratings out of more than 4.5 million total ratings, accounting for approximately 1.45% of the total. Other highly rated categories include Business and Finance, with over 43 thousand ratings (0.94% of the total), and Comedy, with more than 41 thousand ratings (0.9%).

  • Podcasts in the Business & Finance category achieved the highest average ratings, with an impressive average of 4.85. Following closely are podcasts in the Religion & Spirituality category and Education category, both with an average rating of 4.83.

  • The sentiment analysis indicates a predominantly positive sentiment in podcast reviews, with the majority of ratings falling within the positive range.

INSIGHTS:

  • The most rated podcasts have accumulated more than 40 ratings each, with highly favorable average ratings ranging between 5.00 and 4.93.

  • Podcasts consistently rated with an average of 5.0 not only attract a substantial audience but also consistently deliver content that resonates exceptionally well, resulting in consistently high average ratings across episodes.

  • While "Society & Culture" may have the most rated podcasts, along with Business and Finance, and Comedy attracting a significant number of ratings, it's noteworthy that podcasts in the Business & Finance category receive the highest average ratings. This suggests that listeners highly appreciate the quality and value of content offered in this genre. Similarly, Religion & Spirituality and Education categories also boast exceptionally high average ratings, indicating strong listener satisfaction in these areas.

  • The distribution of sentiment analysis highlights a predominantly positive sentiment in podcast reviews, indicating high overall satisfaction among listeners.

⇡¶
Statistical Inference
Target Population: The target population consists of all podcast reviews available in the dataset. However, for the purpose of statistical inference, we are working with a sample of the data rather than the entire population.

Significance Levels:
The chosen significance level for hypothesis testing α = 0.05.


Podcast Ratings VS Voter Count Confidence Intervals:
The confidence interval for the count of ratings is [-15.969287064551526, 3480.769498138696]. This interval provides a range of plausible values for the true population parameter, the total count of ratings for all podcasts in the dataset, with a specified level of confidence 95%. The lower bound of the confidence interval -15.969287064551526 represents the lower estimate of the count of ratings, while the upper bound 3480.769498138696 represents the upper estimate. We can be 95% confident that the true count of ratings falls within this interval.

In [22]:
sample_size = sampled_data.groupby("podcast_id")["rating"].count()
confidence_level = 0.95

t_value = t.ppf((1 + confidence_level) / 2, df=sample_size - 1)
standard_error = np.sqrt(sample_size)

small_sample_mask = sample_size < 2
t_value[small_sample_mask] = 0

margin_of_error = t_value * standard_error
confidence_interval_counts = (
    sample_size - margin_of_error,
    sample_size + margin_of_error,
)

print("Confidence Interval for Count of Ratings:")
display(confidence_interval_counts[:5])

lower_bounds = confidence_interval_counts[0]
upper_bounds = confidence_interval_counts[1]

overall_lower_bound = np.min(lower_bounds)
overall_upper_bound = np.max(upper_bounds)

print("Overall Lower Bound:", overall_lower_bound)
print("Overall Upper Bound:", overall_upper_bound)
Confidence Interval for Count of Ratings:
(podcast_id
 a00018b54eb342567c94dacfb2a3e504     1.000000
 a00071f9aaae9ac725c3a586701abf4d   -15.969287
 a000aa69852b276565c4f5eb9cdd999b    -1.208320
 a0010b283ba17d282c7bb1f9709f0ac3     1.000000
 a0013c50c1e6b24266fdeb10eed6eea7     1.000000
                                       ...    
 fffeb7d6d05f2b4c600fbebc828ca656     5.916641
 ffff09ad9a175a57b1bbbdb3c1581ec0     1.000000
 ffff1a7b221753187b1562bf638010fa     1.000000
 ffff5db4b5db2d860c49749e5de8a36d    -4.452413
 ffff66f98c1adfc8d0d6c41bb8facfd0     1.000000
 Name: rating, Length: 53240, dtype: float64,
 podcast_id
 a00018b54eb342567c94dacfb2a3e504     1.000000
 a00071f9aaae9ac725c3a586701abf4d    19.969287
 a000aa69852b276565c4f5eb9cdd999b    11.208320
 a0010b283ba17d282c7bb1f9709f0ac3     1.000000
 a0013c50c1e6b24266fdeb10eed6eea7     1.000000
                                       ...    
 fffeb7d6d05f2b4c600fbebc828ca656    22.083359
 ffff09ad9a175a57b1bbbdb3c1581ec0     1.000000
 ffff1a7b221753187b1562bf638010fa     1.000000
 ffff5db4b5db2d860c49749e5de8a36d    10.452413
 ffff66f98c1adfc8d0d6c41bb8facfd0     1.000000
 Name: rating, Length: 53240, dtype: float64)
Overall Lower Bound: -15.969287064551526
Overall Upper Bound: 3480.769498138696

Statistical Hypotheses:
Null Hypothesis (H0): There is no difference in average ratings between podcasts and the number of people voting for them.
Alternative Hypothesis (H1): There is a difference in average ratings between podcasts and the number of people voting for them.

Hypothesis Testing:
To test the hypothesis regarding the difference in average ratings between podcasts and the number of people voting for them, the chi-square test was conducted. The test statistic obtained was 14814.31676859763, and the corresponding p-value was 0.0. Using a significance level of α = 0.05, the the p-value was compared to the chosen significance level. Based on the results, we reject the null hypothesis. This indicates that there is a sufficient evidence to conclude that there is a difference in average ratings between podcasts and the number of people voting for them.

In [23]:
contingency_table = pd.crosstab(sampled_data["category"], sampled_data["rating"])

chi2_stat, p_val, dof, expected = stats.chi2_contingency(contingency_table)

alpha = 0.05
print("Chi-square Statistic:", chi2_stat)
print("P-value:", p_val)
print("Degrees of Freedom:", dof)

if p_val < alpha:
    print(
        "Reject the null hypothesis. There is a significant association between ratings and podcast categories."
    )
else:
    print(
        "Fail to reject the null hypothesis. There is no significant association between ratings and podcast categories."
    )

print("Expected Frequencies Table:")
print(expected[:5])
Chi-square Statistic: 14814.31676859763
P-value: 0.0
Degrees of Freedom: 76
Reject the null hypothesis. There is a significant association between ratings and podcast categories.
Expected Frequencies Table:
[[ 1314.57588545   528.51554966   588.06895143   729.23669706
  22199.6029164 ]
 [ 2236.23042975   899.05996894  1000.36650491  1240.50753593
  37763.83556047]
 [ 2134.83806256   858.29591471   955.00913626  1184.26199258
  36051.59489389]
 [ 1860.36332622   747.94536915   832.22423493  1032.00220114
  31416.46486856]
 [  334.55334246   134.50470653   149.66076548   185.58728876
   5649.69389678]]

Category Rating Differences Confidence Intervals:
The confidence interval for the mean of ratings is [-22.412409472864187, 28.412409472864187]. This interval provides a range of plausible values for the true population parameter, the total mean of ratings for all podcasts in the dataset, with a specified level of confidence 95%. The lower bound of the confidence interval -22.412409472864187 represents the lower estimate of the mean of ratings, while the upper bound 28.412409472864187 represents the upper estimate. We can be 95% confident that the true count of ratings falls within this interval.

In [24]:
mean_ratings = sampled_data.groupby("podcast_id")["rating"].mean()
sample_size = sampled_data.groupby("podcast_id")["rating"].count()
sample_std = sampled_data.groupby("podcast_id")["rating"].std()

confidence_level = 0.95
t_value = t.ppf((1 + confidence_level) / 2, df=sample_size - 1)

zero_std_mask = sample_std == 0
sample_std[zero_std_mask] = np.nan
sample_std_filled = sample_std.fillna(0)

small_sample_mask = sample_size < 2
t_value[small_sample_mask] = 0
sample_std[small_sample_mask] = np.nan
sample_std_filled = sample_std.fillna(0)

margin_of_error = t_value * sample_std_filled / np.sqrt(sample_size)
confidence_interval_means = (
    mean_ratings - margin_of_error,
    mean_ratings + margin_of_error,
)

print("Confidence Interval for Mean Rating:")
display(confidence_interval_means[:5])

lower_bounds = confidence_interval_means[0]
upper_bounds = confidence_interval_means[1]

overall_lower_bound = np.min(lower_bounds)
overall_upper_bound = np.max(upper_bounds)

print("Overall Lower Bound:", overall_lower_bound)
print("Overall Upper Bound:", overall_upper_bound)
Confidence Interval for Mean Rating:
(podcast_id
 a00018b54eb342567c94dacfb2a3e504    5.000000
 a00071f9aaae9ac725c3a586701abf4d    5.000000
 a000aa69852b276565c4f5eb9cdd999b    5.000000
 a0010b283ba17d282c7bb1f9709f0ac3    5.000000
 a0013c50c1e6b24266fdeb10eed6eea7    5.000000
                                       ...   
 fffeb7d6d05f2b4c600fbebc828ca656    3.159423
 ffff09ad9a175a57b1bbbdb3c1581ec0    5.000000
 ffff1a7b221753187b1562bf638010fa    1.000000
 ffff5db4b5db2d860c49749e5de8a36d    3.232449
 ffff66f98c1adfc8d0d6c41bb8facfd0    5.000000
 Name: rating, Length: 53240, dtype: float64,
 podcast_id
 a00018b54eb342567c94dacfb2a3e504    5.000000
 a00071f9aaae9ac725c3a586701abf4d    5.000000
 a000aa69852b276565c4f5eb9cdd999b    5.000000
 a0010b283ba17d282c7bb1f9709f0ac3    5.000000
 a0013c50c1e6b24266fdeb10eed6eea7    5.000000
                                       ...   
 fffeb7d6d05f2b4c600fbebc828ca656    5.126291
 ffff09ad9a175a57b1bbbdb3c1581ec0    5.000000
 ffff1a7b221753187b1562bf638010fa    1.000000
 ffff5db4b5db2d860c49749e5de8a36d    6.100884
 ffff66f98c1adfc8d0d6c41bb8facfd0    5.000000
 Name: rating, Length: 53240, dtype: float64)
Overall Lower Bound: -22.412409472864187
Overall Upper Bound: 28.412409472864187

Statistical Hypotheses:
Null Hypothesis (H0): There are no significant differences in rating averages among categories.
Alternative Hypothesis (H1): There are significant differences in rating averages among categories.

Hypothesis Testing:
To test the hypothesis regarding the difference in average ratings between different podcasts categories, the Tukey's Honestly Significant Difference (HSD) test following an ANOVA were conducted. The test statistic obtained was 703.9440772581569, and the corresponding p-value was 0.0. Using a significance level of α = 0.05, the the p-value was compared to the chosen significance level. Based on the results, we reject the null hypothesis. This indicates that there is a sufficient evidence to conclude that there are differences in average ratings between podcasts categories.

In [25]:
category_groups = sampled_data.groupby("category")["rating"]

f_statistic, p_value = stats.f_oneway(*[group for name, group in category_groups])

alpha = 0.05

print("F-statistic:", f_statistic)
print("P-value:", p_value)

if p_value < alpha:
    print(
        "One-way ANOVA: There are significant differences in rating averages among categories."
    )
    tukey_results = pairwise_tukeyhsd(sampled_data["rating"], sampled_data["category"])
    print(tukey_results.summary())
else:
    print(
        "One-way ANOVA: No significant differences in rating averages among categories were found."
    )
F-statistic: 703.9440772581569
P-value: 0.0
One-way ANOVA: There are significant differences in rating averages among categories.
                 Multiple Comparison of Means - Tukey HSD, FWER=0.05                  
======================================================================================
         group1                  group2         meandiff p-adj   lower   upper  reject
--------------------------------------------------------------------------------------
                   Arts      Business & Finance   0.1126    0.0  0.0847  0.1406   True
                   Arts                  Comedy  -0.1056    0.0 -0.1338 -0.0775   True
                   Arts               Education   0.0943    0.0  0.0654  0.1232   True
                   Arts                 Fiction  -0.1184    0.0 -0.1676 -0.0692   True
                   Arts              Government  -0.2996    0.0 -0.3908 -0.2084   True
                   Arts        Health & Fitness    0.053    0.0  0.0237  0.0822   True
                   Arts                 History  -0.2243    0.0  -0.275 -0.1736   True
                   Arts           Kids & Family   0.0135 0.9985 -0.0215  0.0484  False
                   Arts                 Leisure   0.0215 0.7694 -0.0123  0.0553  False
                   Arts                   Music   0.0437  0.045  0.0004   0.087   True
                   Arts         News & Politics  -0.4029    0.0 -0.4323 -0.3735   True
                   Arts                  Others  -0.0838 0.0089 -0.1576   -0.01   True
                   Arts Religion & Spirituality   0.0947    0.0  0.0643  0.1251   True
                   Arts                 Science  -0.2021    0.0  -0.248 -0.1562   True
                   Arts       Society & Culture   -0.163    0.0 -0.1891  -0.137   True
                   Arts     Sports & Recreation  -0.1035    0.0 -0.1332 -0.0737   True
                   Arts               TV & Film  -0.1885    0.0 -0.2198 -0.1572   True
                   Arts              Technology  -0.1703    0.0 -0.2256  -0.115   True
                   Arts              True Crime  -0.5634    0.0 -0.5988  -0.528   True
     Business & Finance                  Comedy  -0.2183    0.0 -0.2426  -0.194   True
     Business & Finance               Education  -0.0183 0.5222 -0.0435  0.0069  False
     Business & Finance                 Fiction  -0.2311    0.0 -0.2781  -0.184   True
     Business & Finance              Government  -0.4123    0.0 -0.5024 -0.3221   True
     Business & Finance        Health & Fitness  -0.0597    0.0 -0.0853 -0.0341   True
     Business & Finance                 History   -0.337    0.0 -0.3857 -0.2883   True
     Business & Finance           Kids & Family  -0.0992    0.0 -0.1311 -0.0673   True
     Business & Finance                 Leisure  -0.0912    0.0 -0.1219 -0.0605   True
     Business & Finance                   Music  -0.0689    0.0 -0.1099  -0.028   True
     Business & Finance         News & Politics  -0.5155    0.0 -0.5413 -0.4898   True
     Business & Finance                  Others  -0.1964    0.0 -0.2689  -0.124   True
     Business & Finance Religion & Spirituality  -0.0179 0.6849 -0.0448  0.0089  False
     Business & Finance                 Science  -0.3148    0.0 -0.3584 -0.2711   True
     Business & Finance       Society & Culture  -0.2757    0.0 -0.2975 -0.2539   True
     Business & Finance     Sports & Recreation  -0.2161    0.0 -0.2423   -0.19   True
     Business & Finance               TV & Film  -0.3011    0.0  -0.329 -0.2733   True
     Business & Finance              Technology  -0.2829    0.0 -0.3364 -0.2295   True
     Business & Finance              True Crime  -0.6761    0.0 -0.7085 -0.6436   True
                 Comedy               Education      0.2    0.0  0.1745  0.2254   True
                 Comedy                 Fiction  -0.0128    1.0   -0.06  0.0344  False
                 Comedy              Government   -0.194    0.0 -0.2842 -0.1038   True
                 Comedy        Health & Fitness   0.1586    0.0  0.1327  0.1845   True
                 Comedy                 History  -0.1187    0.0 -0.1675 -0.0699   True
                 Comedy           Kids & Family   0.1191    0.0   0.087  0.1512   True
                 Comedy                 Leisure   0.1271    0.0  0.0962   0.158   True
                 Comedy                   Music   0.1493    0.0  0.1083  0.1904   True
                 Comedy         News & Politics  -0.2972    0.0 -0.3232 -0.2712   True
                 Comedy                  Others   0.0219    1.0 -0.0507  0.0944  False
                 Comedy Religion & Spirituality   0.2003    0.0  0.1733  0.2274   True
                 Comedy                 Science  -0.0965    0.0 -0.1403 -0.0527   True
                 Comedy       Society & Culture  -0.0574    0.0 -0.0795 -0.0353   True
                 Comedy     Sports & Recreation   0.0021    1.0 -0.0242  0.0285  False
                 Comedy               TV & Film  -0.0829    0.0  -0.111 -0.0548   True
                 Comedy              Technology  -0.0646 0.0031 -0.1182 -0.0111   True
                 Comedy              True Crime  -0.4578    0.0 -0.4904 -0.4251   True
              Education                 Fiction  -0.2127    0.0 -0.2604  -0.165   True
              Education              Government  -0.3939    0.0 -0.4844 -0.3035   True
              Education        Health & Fitness  -0.0414    0.0 -0.0681 -0.0146   True
              Education                 History  -0.3186    0.0 -0.3679 -0.2694   True
              Education           Kids & Family  -0.0809    0.0 -0.1137 -0.0481   True
              Education                 Leisure  -0.0729    0.0 -0.1045 -0.0413   True
              Education                   Music  -0.0506 0.0027 -0.0922  -0.009   True
              Education         News & Politics  -0.4972    0.0  -0.524 -0.4703   True
              Education                  Others  -0.1781    0.0 -0.2509 -0.1053   True
              Education Religion & Spirituality   0.0004    1.0 -0.0275  0.0283  False
              Education                 Science  -0.2964    0.0 -0.3408 -0.2521   True
              Education       Society & Culture  -0.2573    0.0 -0.2805 -0.2342   True
              Education     Sports & Recreation  -0.1978    0.0  -0.225 -0.1706   True
              Education               TV & Film  -0.2828    0.0 -0.3117 -0.2539   True
              Education              Technology  -0.2646    0.0 -0.3186 -0.2106   True
              Education              True Crime  -0.6577    0.0 -0.6911 -0.6244   True
                Fiction              Government  -0.1812    0.0   -0.28 -0.0824   True
                Fiction        Health & Fitness   0.1714    0.0  0.1235  0.2193   True
                Fiction                 History  -0.1059    0.0 -0.1692 -0.0426   True
                Fiction           Kids & Family   0.1319    0.0  0.0803  0.1834   True
                Fiction                 Leisure   0.1399    0.0  0.0891  0.1906   True
                Fiction                   Music   0.1621    0.0  0.1046  0.2197   True
                Fiction         News & Politics  -0.2845    0.0 -0.3324 -0.2365   True
                Fiction                  Others   0.0346 0.9959 -0.0483  0.1176  False
                Fiction Religion & Spirituality   0.2131    0.0  0.1646  0.2617   True
                Fiction                 Science  -0.0837 0.0001 -0.1432 -0.0242   True
                Fiction       Society & Culture  -0.0446 0.0701 -0.0906  0.0014  False
                Fiction     Sports & Recreation   0.0149 0.9999 -0.0333  0.0631  False
                Fiction               TV & Film  -0.0701 0.0001 -0.1192 -0.0209   True
                Fiction              Technology  -0.0519 0.3971 -0.1189  0.0152  False
                Fiction              True Crime   -0.445    0.0 -0.4969 -0.3931   True
             Government        Health & Fitness   0.3526    0.0   0.262  0.4431   True
             Government                 History   0.0753 0.4438 -0.0243  0.1748  False
             Government           Kids & Family   0.3131    0.0  0.2205  0.4056   True
             Government                 Leisure   0.3211    0.0  0.2289  0.4132   True
             Government                   Music   0.3433    0.0  0.2473  0.4393   True
             Government         News & Politics  -0.1033 0.0083 -0.1939 -0.0127   True
             Government                  Others   0.2158    0.0  0.1027  0.3289   True
             Government Religion & Spirituality   0.3943    0.0  0.3034  0.4852   True
             Government                 Science   0.0975 0.0484  0.0003  0.1947   True
             Government       Society & Culture   0.1366    0.0   0.047  0.2261   True
             Government     Sports & Recreation   0.1961    0.0  0.1054  0.2868   True
             Government               TV & Film   0.1111 0.0026  0.0199  0.2023   True
             Government              Technology   0.1293 0.0012  0.0274  0.2313   True
             Government              True Crime  -0.2638    0.0 -0.3565 -0.1711   True
       Health & Fitness                 History  -0.2773    0.0 -0.3268 -0.2278   True
       Health & Fitness           Kids & Family  -0.0395 0.0038 -0.0726 -0.0064   True
       Health & Fitness                 Leisure  -0.0315 0.0583 -0.0634  0.0004  False
       Health & Fitness                   Music  -0.0092    1.0 -0.0511  0.0326  False
       Health & Fitness         News & Politics  -0.4558    0.0 -0.4831 -0.4286   True
       Health & Fitness                  Others  -0.1367    0.0 -0.2097 -0.0638   True
       Health & Fitness Religion & Spirituality   0.0418    0.0  0.0135    0.07   True
       Health & Fitness                 Science  -0.2551    0.0 -0.2996 -0.2105   True
       Health & Fitness       Society & Culture   -0.216    0.0 -0.2395 -0.1924   True
       Health & Fitness     Sports & Recreation  -0.1564    0.0  -0.184 -0.1288   True
       Health & Fitness               TV & Film  -0.2414    0.0 -0.2707 -0.2122   True
       Health & Fitness              Technology  -0.2232    0.0 -0.2774 -0.1691   True
       Health & Fitness              True Crime  -0.6164    0.0   -0.65 -0.5827   True
                History           Kids & Family   0.2378    0.0  0.1848  0.2908   True
                History                 Leisure   0.2458    0.0  0.1935  0.2981   True
                History                   Music    0.268    0.0  0.2092  0.3269   True
                History         News & Politics  -0.1785    0.0 -0.2281  -0.129   True
                History                  Others   0.1405    0.0  0.0567  0.2244   True
                History Religion & Spirituality    0.319    0.0  0.2689  0.3692   True
                History                 Science   0.0222 0.9993 -0.0386   0.083  False
                History       Society & Culture   0.0613 0.0009  0.0137  0.1089   True
                History     Sports & Recreation   0.1208    0.0  0.0711  0.1706   True
                History               TV & Film   0.0358 0.5793 -0.0148  0.0865  False
                History              Technology   0.0541 0.3489 -0.0141  0.1222  False
                History              True Crime  -0.3391    0.0 -0.3924 -0.2857   True
          Kids & Family                 Leisure    0.008    1.0 -0.0292  0.0452  False
          Kids & Family                   Music   0.0303 0.7125 -0.0157  0.0762  False
          Kids & Family         News & Politics  -0.4163    0.0 -0.4495 -0.3831   True
          Kids & Family                  Others  -0.0972 0.0008 -0.1726 -0.0218   True
          Kids & Family Religion & Spirituality   0.0813    0.0  0.0472  0.1153   True
          Kids & Family                 Science  -0.2156    0.0  -0.264 -0.1672   True
          Kids & Family       Society & Culture  -0.1765    0.0 -0.2067 -0.1462   True
          Kids & Family     Sports & Recreation  -0.1169    0.0 -0.1505 -0.0834   True
          Kids & Family               TV & Film  -0.2019    0.0 -0.2368 -0.1671   True
          Kids & Family              Technology  -0.1837    0.0 -0.2411 -0.1263   True
          Kids & Family              True Crime  -0.5769    0.0 -0.6155 -0.5382   True
                Leisure                   Music   0.0223 0.9726 -0.0229  0.0674  False
                Leisure         News & Politics  -0.4243    0.0 -0.4564 -0.3923   True
                Leisure                  Others  -0.1052 0.0001 -0.1801 -0.0303   True
                Leisure Religion & Spirituality   0.0733    0.0  0.0403  0.1062   True
                Leisure                 Science  -0.2236    0.0 -0.2712  -0.176   True
                Leisure       Society & Culture  -0.1845    0.0 -0.2135 -0.1555   True
                Leisure     Sports & Recreation  -0.1249    0.0 -0.1573 -0.0926   True
                Leisure               TV & Film  -0.2099    0.0 -0.2437 -0.1762   True
                Leisure              Technology  -0.1917    0.0 -0.2485  -0.135   True
                Leisure              True Crime  -0.5849    0.0 -0.6225 -0.5472   True
                  Music         News & Politics  -0.4466    0.0 -0.4885 -0.4046   True
                  Music                  Others  -0.1275    0.0 -0.2071 -0.0478   True
                  Music Religion & Spirituality    0.051 0.0037  0.0084  0.0936   True
                  Music                 Science  -0.2458    0.0 -0.3006  -0.191   True
                  Music       Society & Culture  -0.2067    0.0 -0.2464 -0.1671   True
                  Music     Sports & Recreation  -0.1472    0.0 -0.1894  -0.105   True
                  Music               TV & Film  -0.2322    0.0 -0.2755 -0.1889   True
                  Music              Technology   -0.214    0.0 -0.2769 -0.1511   True
                  Music              True Crime  -0.6071    0.0 -0.6535 -0.5608   True
        News & Politics                  Others   0.3191    0.0  0.2461  0.3921   True
        News & Politics Religion & Spirituality   0.4976    0.0  0.4692   0.526   True
        News & Politics                 Science   0.2007    0.0  0.1561  0.2454   True
        News & Politics       Society & Culture   0.2398    0.0  0.2161  0.2635   True
        News & Politics     Sports & Recreation   0.2994    0.0  0.2717  0.3271   True
        News & Politics               TV & Film   0.2144    0.0   0.185  0.2437   True
        News & Politics              Technology   0.2326    0.0  0.1784  0.2868   True
        News & Politics              True Crime  -0.1606    0.0 -0.1943 -0.1268   True
                 Others Religion & Spirituality   0.1785    0.0  0.1051  0.2519   True
                 Others                 Science  -0.1183    0.0 -0.1994 -0.0373   True
                 Others       Society & Culture  -0.0792 0.0134  -0.151 -0.0075   True
                 Others     Sports & Recreation  -0.0197    1.0 -0.0929  0.0534  False
                 Others               TV & Film  -0.1047 0.0001 -0.1785 -0.0309   True
                 Others              Technology  -0.0865 0.0517 -0.1732  0.0003  False
                 Others              True Crime  -0.4796    0.0 -0.5553  -0.404   True
Religion & Spirituality                 Science  -0.2968    0.0 -0.3421 -0.2516   True
Religion & Spirituality       Society & Culture  -0.2577    0.0 -0.2826 -0.2328   True
Religion & Spirituality     Sports & Recreation  -0.1982    0.0 -0.2269 -0.1695   True
Religion & Spirituality               TV & Film  -0.2832    0.0 -0.3135 -0.2529   True
Religion & Spirituality              Technology   -0.265    0.0 -0.3197 -0.2102   True
Religion & Spirituality              True Crime  -0.6581    0.0 -0.6927 -0.6235   True
                Science       Society & Culture   0.0391 0.1175 -0.0034  0.0816  False
                Science     Sports & Recreation   0.0986    0.0  0.0538  0.1435   True
                Science               TV & Film   0.0136    1.0 -0.0322  0.0595  False
                Science              Technology   0.0319 0.9729 -0.0328  0.0965  False
                Science              True Crime  -0.3613    0.0 -0.4101 -0.3125   True
      Society & Culture     Sports & Recreation   0.0595    0.0  0.0354  0.0837   True
      Society & Culture               TV & Film  -0.0255 0.0626 -0.0515  0.0005  False
      Society & Culture              Technology  -0.0073    1.0 -0.0597  0.0452  False
      Society & Culture              True Crime  -0.4004    0.0 -0.4313 -0.3695   True
    Sports & Recreation               TV & Film   -0.085    0.0 -0.1147 -0.0553   True
    Sports & Recreation              Technology  -0.0668 0.0023 -0.1212 -0.0124   True
    Sports & Recreation              True Crime  -0.4599    0.0  -0.494 -0.4259   True
              TV & Film              Technology   0.0182 0.9998 -0.0371  0.0735  False
              TV & Film              True Crime  -0.3749    0.0 -0.4103 -0.3395   True
             Technology              True Crime  -0.3931    0.0 -0.4509 -0.3354   True
--------------------------------------------------------------------------------------

Temporal Analysis Confidence Intervals:
The confidence interval for the mean rating score is [4.631971527166266, 4.646939640938188]. This interval provides a range of plausible values for the true population parameter, the total mean of ratings for all podcasts in the dataset, with a specified level of confidence of 95%. The lower bound of the confidence interval represents the lower estimate of the mean rating score, while the upper bound represents the upper estimate. We can be 95% confident that the true mean rating score falls within this interval.

In [26]:
sampled_data["created_at"] = pd.to_datetime(sampled_data["created_at"])
sampled_data["month"] = sampled_data["created_at"].dt.month

monthly_rating = sampled_data.groupby("month")["rating"].mean()
mean_rating = monthly_rating.mean()
std_rating = monthly_rating.std()
std_error = std_rating / len(monthly_rating) ** 0.5

# T-score for 95% confidence level
t_score = t.ppf(0.975, df=len(monthly_rating) - 1)

confidence_intervals = []
for month, rating in monthly_rating.items():
    lower_bound = rating - t_score * std_error
    upper_bound = rating + t_score * std_error
    confidence_intervals.append((month, lower_bound, upper_bound))

print("Confidence intervals for mean rating by each month:")
for month, lower, upper in confidence_intervals:
    print(f"Month {month}: ({lower}, {upper})")
Confidence intervals for mean rating by each month:
Month 1: (4.640083007927041, 4.655051121698962)
Month 2: (4.667997531372459, 4.68296564514438)
Month 3: (4.6466592300946035, 4.661627343866525)
Month 4: (4.66141280995477, 4.676380923726692)
Month 5: (4.658119006292087, 4.673087120064008)
Month 6: (4.631688997949905, 4.646657111721827)
Month 7: (4.648462691048918, 4.663430804820839)
Month 8: (4.656061846472697, 4.671029960244618)
Month 9: (4.645996759562044, 4.660964873333965)
Month 10: (4.6458571438811145, 4.660825257653036)
Month 11: (4.634314514342503, 4.649282628114424)
Month 12: (4.631971527166266, 4.646939640938188)

Statistical Hypotheses:
Null Hypothesis (H0): There are specific time periods are associated with higher or lower ratings.
Alternative Hypothesis (H1): There are no specific time periods are associated with higher or lower ratings.

Hypothesis Testing:
To test the hypothesis regarding correlation between reviews time periods and ratings, the one-way ANOVA was checked. The test statistic obtained was 5.048049115266022, and the corresponding p-value was 0.0. Using a significance level of α = 0.05, the the p-value was compared to the chosen significance level. Based on the results, we reject the null hypothesis. This indicates that there is a sufficient evidence to conclude that there are specific time periods are associated with higher or lower ratings.

In [27]:
month_groups = [group["rating"] for _, group in sampled_data.groupby("month")]

f_statistic, p_value = f_oneway(*month_groups)
print(f"P-value is: {p_val}, and test statistics is: {f_statistic}")
if p_value < 0.05:
    print("There are significant differences in ratings among different months.")
else:
    print("No significant differences in ratings among different months were found.")
P-value is: 0.0, and test statistics is: 5.048049115266022
There are significant differences in ratings among different months.
In [28]:
tukey_results = pairwise_tukeyhsd(sampled_data["rating"], sampled_data["month"])
print(tukey_results.summary())
Multiple Comparison of Means - Tukey HSD, FWER=0.05 
====================================================
group1 group2 meandiff p-adj   lower   upper  reject
----------------------------------------------------
     1      2   0.0279 0.0058  0.0044  0.0514   True
     1      3   0.0066 0.9991 -0.0171  0.0303  False
     1      4   0.0213 0.1273 -0.0024   0.045  False
     1      5    0.018 0.3398 -0.0055  0.0416  False
     1      6  -0.0084 0.9916  -0.032  0.0152  False
     1      7   0.0084 0.9918 -0.0152   0.032  False
     1      8    0.016 0.5224 -0.0074  0.0393  False
     1      9   0.0059 0.9996 -0.0172  0.0291  False
     1     10   0.0058 0.9997 -0.0173  0.0289  False
     1     11  -0.0058 0.9998 -0.0297  0.0181  False
     1     12  -0.0081 0.9949 -0.0323   0.016  False
     2      3  -0.0213 0.1426 -0.0454  0.0028  False
     2      4  -0.0066 0.9992 -0.0307  0.0175  False
     2      5  -0.0099 0.9729 -0.0339  0.0141  False
     2      6  -0.0363    0.0 -0.0603 -0.0123   True
     2      7  -0.0195 0.2469 -0.0435  0.0045  False
     2      8  -0.0119 0.8933 -0.0357  0.0118  False
     2      9   -0.022 0.0943 -0.0456  0.0016  False
     2     10  -0.0221 0.0877 -0.0457  0.0014  False
     2     11  -0.0337 0.0004  -0.058 -0.0094   True
     2     12   -0.036 0.0001 -0.0606 -0.0115   True
     3      4   0.0148 0.7053 -0.0096  0.0391  False
     3      5   0.0115 0.9268 -0.0127  0.0356  False
     3      6   -0.015  0.679 -0.0392  0.0092  False
     3      7   0.0018    1.0 -0.0224   0.026  False
     3      8   0.0094 0.9813 -0.0145  0.0334  False
     3      9  -0.0007    1.0 -0.0244  0.0231  False
     3     10  -0.0008    1.0 -0.0245  0.0229  False
     3     11  -0.0123 0.8916 -0.0368  0.0122  False
     3     12  -0.0147 0.7342 -0.0394  0.0101  False
     4      5  -0.0033    1.0 -0.0275  0.0209  False
     4      6  -0.0297 0.0035 -0.0539 -0.0055   True
     4      7   -0.013 0.8464 -0.0372  0.0113  False
     4      8  -0.0054 0.9999 -0.0293  0.0186  False
     4      9  -0.0154 0.6101 -0.0392  0.0084  False
     4     10  -0.0156 0.5929 -0.0393  0.0082  False
     4     11  -0.0271  0.016 -0.0516 -0.0026   True
     4     12  -0.0294 0.0058 -0.0542 -0.0047   True
     5      6  -0.0264 0.0175 -0.0505 -0.0023   True
     5      7  -0.0097 0.9781 -0.0337  0.0144  False
     5      8  -0.0021    1.0 -0.0259  0.0218  False
     5      9  -0.0121 0.8799 -0.0358  0.0115  False
     5     10  -0.0123 0.8698 -0.0359  0.0113  False
     5     11  -0.0238 0.0632 -0.0482  0.0006  False
     5     12  -0.0261 0.0262 -0.0508 -0.0015   True
     6      7   0.0168 0.4963 -0.0073  0.0409  False
     6      8   0.0244 0.0401  0.0005  0.0482   True
     6      9   0.0143  0.711 -0.0094   0.038  False
     6     10   0.0142 0.7212 -0.0095  0.0378  False
     6     11   0.0026    1.0 -0.0218   0.027  False
     6     12   0.0003    1.0 -0.0244  0.0249  False
     7      8   0.0076 0.9968 -0.0163  0.0315  False
     7      9  -0.0025    1.0 -0.0262  0.0212  False
     7     10  -0.0026    1.0 -0.0262   0.021  False
     7     11  -0.0141 0.7634 -0.0386  0.0103  False
     7     12  -0.0165 0.5605 -0.0412  0.0082  False
     8      9  -0.0101  0.963 -0.0335  0.0134  False
     8     10  -0.0102 0.9585 -0.0336  0.0132  False
     8     11  -0.0217 0.1262 -0.0459  0.0024  False
     8     12  -0.0241 0.0569 -0.0485  0.0003  False
     9     10  -0.0001    1.0 -0.0233   0.023  False
     9     11  -0.0117 0.9124 -0.0357  0.0123  False
     9     12   -0.014  0.765 -0.0383  0.0102  False
    10     11  -0.0115 0.9179 -0.0355  0.0124  False
    10     12  -0.0139 0.7743 -0.0381  0.0103  False
    11     12  -0.0023    1.0 -0.0273  0.0226  False
----------------------------------------------------
In [29]:
cnx.close()

OUTCOMES:

  • To test the hypothesis regarding the difference in average ratings between podcasts and the number of people voting for them, the chi-square test was conducted. The test statistic obtained was 14814.31676859763, and the corresponding p-value was 0.0. Using a significance level of α = 0.05, the p-value was compared to the chosen significance level. Based on the results, we reject the null hypothesis.
  • To test the hypothesis regarding the difference in average ratings between different podcasts categories, the Tukey's Honestly Significant Difference (HSD) test following an ANOVA was conducted. The test statistic obtained was 703.9440772581569, and the corresponding p-value was 0.0. Using a significance level of α = 0.05, the p-value was compared to the chosen significance level. Based on the results, we reject the null hypothesis.
  • To test the hypothesis regarding the difference in average ratings between different months, an analysis of variance (ANOVA) was conducted. The F-statistic obtained was 5.048049115266022, and the corresponding p-value was 0.0. Using a significance level of α = 0.05, the p-value was compared to the chosen significance level. Based on the results, we reject the null hypothesis.

INSIGHTS:

  • The results of the chi-square test indicate that there is sufficient evidence to conclude that there is a difference in average ratings between podcasts and the number of people voting for them.
  • The results of the HSD and ANOVA tests indicate that there is sufficient evidence to conclude that there are differences in average ratings between podcasts categories. Specifically Arts category received significantly higher ratings compared to other categories, while Government, Health & Fitness, News & Politics, and True Crime categories received significantly lower ratings than other categories.
  • February and June tend to have higher average ratings compared to other months. October and November tend to have lower average ratings compared to other months. There is no significant difference in ratings between February and June, indicating that both months have similarly high ratings. Similarly, there is no significant difference in ratings between October and November, indicating that both months have similarly low ratings.


⇡¶

CONCLUSIONS

¶

SUMMARY:

  1. Difference in Podcast Ratings and Voting Counts:

    • The chi-square test reveals a significant difference between average ratings and the number of people voting for podcasts. This suggests that while some podcasts may receive high ratings, the number of votes can vary significantly, indicating potential disparities in audience engagement and reach.
  2. Variation in Average Ratings Across Podcast Categories:

    • The HSD and ANOVA tests demonstrate notable differences in average ratings across different podcast categories. Categories such as Arts tend to receive higher ratings, whereas Government, Health & Fitness, News & Politics, and True Crime categories receive lower ratings on average. Understanding these variations can help in tailoring content and marketing strategies to better suit audience preferences within each category.
  3. Seasonal Trends in Podcast Ratings:

    • Monthly analysis reveals variations in average ratings across different months. February and June exhibit higher average ratings compared to other months, while October and November tend to have lower ratings. Further investigation into the reasons behind these seasonal trends can provide insights for content scheduling and promotion strategies.

¶

INSIGHTS:

  1. The significant difference between podcast ratings and voting counts suggests that while some podcasts may have high satisfaction levels among listeners, their reach and engagement might not align with their quality. This calls for strategies to enhance visibility and engagement for high-quality podcasts.

¶

POTENTIAL AREAS FOR INVESTIGATION:

Observations Relevant to Stakeholders:

  1. Explore factors influencing the observed differences in voting counts among podcasts with similar average ratings. This could involve analyzing promotion strategies, audience demographics, or platform visibility.
  2. Explore audience feedback and preferences to identify areas for content improvement. This could involve analyzing listener reviews, ratings, and engagement metrics to understand what resonates most with the audience and adjust content strategies accordingly.

Observations Relevant to Analysts (processing would require more data):

  1. Examine additional variables that may contribute to the observed variations in average ratings across podcast categories, such as content format, host expertise, or audience demographics.
  2. Further investigate the impact of promotional activities, release schedules, and episode lengths on podcast ratings to optimize content strategies for maximizing audience satisfaction and engagement.